How Talker Identity Relates to Language Processing
نویسندگان
چکیده
Speech carries both linguistic content – phonemes, words, sentences – and talker information, sometimes called ‘indexical information’. While talker variability materially affects language processing, it has historically been regarded as a curiosity rather than a central influence, possibly because talker variability does not fit with a conception of speech sounds as abstract categories. Despite this relegation to the periphery, a long history of research suggests that phoneme perception and talker perception are interrelated. The current review argues that speech perception itself may arise from phylogenetically earlier vocal recognition, and discusses evidence that many cues to talker identity are also cues to speech-sound identity. Rather than brushing talker differences aside, explicit examination of the role of talker variability and talker identity in language processing can illuminate our understanding of the origins of spoken language, and the nature of language representations themselves. Spoken language contains a great amount of communicative information. Speech can refer to things, but it also identifies or classifies the person speaking. This dual function links language to other types of vocal communication systems, and means that recognizing speech and recognizing speakers are intertwined. Imagine someone says the word cat. The vocal pitch of ‘cat’ is 200 Hz, and the delay between opening of the vocal tract and the beginning of vocal-cord vibration is 80 ms. The listener has to process all of this acoustic information to identify this word as ⁄kæt ⁄ . Many would agree that the vocal pitch is relatively unimportant, while the 80-ms voice onset time is crucial because it distinguishes the phoneme ⁄k ⁄ from the very similar phoneme ⁄g ⁄ . But what happens to the ‘unimportant’ information? On a modular view, information not linked to phoneme identity is discarded by the speech-processing system (e.g. Liberman and Mattingly 1985). Nonetheless, details such as vocal pitch may influence comprehension because they indicate the speaker’s identity, and can refine the listener’s expectations about what particular speakers are likely to talk about. Differing theories of language processing have different accounts of talker identification. On one view, two separate, independent systems process speech and talker identification (Belin et al. 2004; González and McLennan 2007). Implicit in the two-systems claim is the assumption that acoustic cues to word identity and talker identity are completely independent of each other. On a different view of language processing, both talker identification and language processing arose from evolutionarily earlier capacities to recognize individuals from vocal cues (e.g. Owren and Cardillo 2006), and are thus intertwined. The one-system claim suggests that talker identification and language processing utilize the same set of memory representations, and that humans learn (during language acquisition) what elements of sound suggest a difference in meaning, a difference in talker, or both. In fact, it has been known for decades that speech-sound characteristics and talker characteristics are not independent (e.g. Bricker and Pruzansky 1966; Ladefoged and Language and Linguistics Compass 5/5 (2011): 190–204, 10.1111/j.1749-818x.2011.00276.x a 2011 The Authors Language and Linguistics Compass a 2011 Blackwell Publishing Ltd Broadbent 1957). Returning to the cat example, it turns out that knowing the speaker’s vocal pitch in cat is important for identifying the vowel as ⁄æ ⁄ . Listeners interpret formants (louder frequency regions present in vowels, that differ from vowel to vowel) differently depending on the voice’s pitch (the fundamental frequency, or f0) (see e.g. Hillenbrand et al. 1995; Miller 1953). Further, listeners can use formants alone to identify talker gender (Fellowes et al. 1997), suggesting that phoneme representations and talkerspecific details are closely linked. The following text outlines what language scientists should know about the role of talker variability in language processing. This information can help us understand the evolutionary roots of human language – how similar is it to other animals’ vocal communications? It also speaks to the modularity of speech processing – if speech processing is truly modular, talker information should not impinge. However, as we will see, talker information profoundly affects speech processing. In the first two sections below, we outline vocal recognition in animals and human listeners. The third section considers how talker-specific sound properties impact speech recognition. The fourth section briefly outlines how talker-varying acoustic properties can constrain sentence comprehension. Finally, we summarize and discuss outstanding questions in this area. 1. Individual Recognition in Other Species To understand where language may come from, it is important to understand vocal communication systems in other species. Discussions of language evolution often focus on the diversity of meaning that can be communicated through syntactic recombination of words. Non-human animals are far more limited in transmitting meaning by recombining sound units (see e.g. Zuberbühler 2002). How this ability arose in humans is an important scientific question. Sometimes overlooked, however, is a more straightforward parallel between speech and animal vocalizations: communicating the vocalizer’s identity and physical characteristics. We focus primarily on birdsong, an animal model frequently used as an analog of human speech. Vocal learning in humans and birds is a clear case of convergent evolution, since our common ancestor approximately 300 million years ago is shared with many non-vocal-learning animals (including reptiles). We also discuss acoustic cues to identity in non-human primates, whose communication systems may be homologous to human speech. Birdsong is a natural acoustic communication signal used by about 4,500 species of oscine birds (songbirds) to attract mates, claim territory and recognize individual species members (reviewed by Kroodsma and Miller 1996). The songbird auditory system is highly specialized for the perception of song, and physiological study of auditory brain regions has revealed neurons tuned to represent song (reviewed by Bolhuis and Gahr 2006). Birdsong is an important animal model of vocal learning, and is the beststudied non-human analog to speech, primarily because its development parallels the development of human speech perception and production (Doupe and Kuhl 1999). Unlike our nearest primate relatives, who do not learn their vocalizations (though a few mammals do, e.g. whales, dolphins and elephants; Janik and Slater 1997), songbirds are prolific vocal learners; many species learn new songs throughout life. If isolated or deafened during development, songbirds do not produce normal song (Konishi 1963; Marler and Tamura 1964; reviewed by Brenowitz et al. 1997), and continued auditory feedback is required for normal song to be maintained (Nordeen and Nordeen 1992). Talkers and Language Processing 191 a 2011 The Authors Language and Linguistics Compass 5/5 (2011): 190–204, 10.1111/j.1749-818x.2011.00276.x Language and Linguistics Compass a 2011 Blackwell Publishing Ltd Both birdsong and speech are used by species members to acquire information about the individual producing the vocalization. This information may include cues to an individual’s sex, health and reproductive state, and can often be used to precisely identify another individual (Falls 1982). In speech, both characteristics of the vocal apparatus and utterance content (one’s choice of words) can conceivably be used to recognize talkers. Similarly, across bird species, vocal-tract characteristics (Weary and Krebs 1992) and song content (Beecher et al. 1994; Gentner and Hulse 2000; see Figure 1) have been implicated in individual recognition in upward of 130 different bird species (Stoddard 1996). Different species use different acoustic features for recognition. For example, song sparrows can use song content both to discriminate between the songs of neighbors and strangers, and to recognize the songs of individual neighbors (Beecher et al. 1991); they are thought to memorize the song types of individual singers to perform this discrimination. European starlings are lifelong vocal learners who sing complex songs that often last longer than 1 minute. These songs comprise repeated units called motifs (Figure 1), and laboratory studies suggest that starlings learn to associate particular motifs with individual singers (Gentner and Hulse 2000). Individual birds’ songs are often highly distinct, and this variability is important for successful individual recognition. Because of memory limitations, birds are likely to concentrate their resources on learning to identify critical individuals (mates, parents). Several avian species are monogamous (at least through a single breeding season), making mate recognition an important function of song. Studies in several species have shown that birds can recognize their mate’s song even in the context of the song of many other individuals (Lind et al. 1996; Miller 1979b). Similarly, female zebra finches, which may not recognize large numbers of individuals, can recognize their fathers even after two months of separation (Miller 1979a). Primate vocalizations are also of interest, because primates are our closest living relatives. Mechanisms of vocal production, described more thoroughly in Section 2, are similar among primate species (including humans). Many primate species produce calls with individually distinctive acoustic features, including rhesus macaques, Barbary macaques, squirrel monkeys, vervet monkeys and others (Cheney and Seyfarth 1980; Hammerschmidt and Todt 1995; Rendall et al. 1996; Rendall et al. 1998). Rendall et al. (1998) mathematically analyzed the acoustics of macaque calls, and determined that at least one call type, coos (Figure 2), provided reliable acoustic cues for discriminating among female Rhesus macaques (Macaca mulatta) with 79.4% accuracy. Fig 1. Starling song (courtesy of T. Gentner). Starlings may recognize each other by the motifs they produce in their songs. Red brackets mark two instances of the same motif. 192 Sarah C. Creel and Micah R. Bregman a 2011 The Authors Language and Linguistics Compass 5/5 (2011): 190–204, 10.1111/j.1749-818x.2011.00276.x Language and Linguistics Compass a 2011 Blackwell Publishing Ltd In ‘playback’ experiments, researchers assess whether animals actually use these acoustic identity cues by presenting recordings of other animals’ vocalizations from audio speakers in the animal’s habitat, and measuring their behavioral response. Reliable call discrimination is evident if animals respond differently to different animals’ vocalizations. Such experiments suggest that several primate species distinguish individuals by their calls, including forest monkeys (e.g. Waser 1977), rhesus macaques (Rendall et al. 1996), and baboons (Bergman et al. 2003). Baboons recognize kin relationships of other baboons (Cheney and Seyfarth 1999). Primates may use vocal cues to assess other attributes as well (age, sex, body size; reviewed in Ey et al. 2007). Among several primate species, common predictive cues include: mean (average) f0, f0 range, f0 variability, dispersion of power throughout the spectrum, mean gap length (gap = section of a call with low amplitude), frequency in the spectrum with highest amplitude, and call duration (Ey et al.). F0 has often been suggested as a useful predictor of body size; however in primates (including humans) the correlation seems weak once age is accounted for (e.g. Lass and Brown 1978; Perry et al. 2001; though see Krauss et al. 2002). Formant frequency dispersion (pitch distance between vocal-tract resonances; Fitch 1997; Ghazanfar et al. 2007) provides better information, at least in macaques. As we will see below, formants may also be important for identifying human voices (e.g. Fellowes et al. 1997). Vocal communication systems are enormously diverse among species. Despite this diversity, vocalizations allow listeners to acquire information about the producer. This is often overlooked in human language, where we focus on how speech communicates arbitrary meaning, rather than information about individuals. Human vocalizations may be homologous (evolutionarily related) to primate vocalizations in terms of the sounds produced and recognized (formants, not motifs), but human vocalizations are analogous to bird song in that they are flexibly learned. We now examine vocal cues used by human listeners to identify talkers. As with primates, acoustic cues underlying individual recognition of humans are not fully understood, although there has been substantial progress. Fig 2. Macaque coo (courtesy of A. Ghazanfar). Small horizontal bands with a slight inverted-U contour are individual harmonics (integer frequency multiples) of the fundamental frequency; thicker, dark horizontal bands, highlighted with red lines, are formants. Macaques may use formant dispersion (the distance between formants) to identify each other. Talkers and Language Processing 193 a 2011 The Authors Language and Linguistics Compass 5/5 (2011): 190–204, 10.1111/j.1749-818x.2011.00276.x Language and Linguistics Compass a 2011 Blackwell Publishing Ltd 2. What Acoustic Attributes Vary Between Talkers? As the reader likely observes regularly, many acoustic characteristics vary amongst talkers. These can be divided into ‘source’ characteristics – properties arising from the vibration of the glottis (the vocal cords, or, more accurately, vocal folds) – and ‘filter’ characteristics – sounds created by dynamically changing the shape and size of the vocal tract, particularly the mouth. Source and filter are largely independent. The source can change while the filter stays put: the vowel ‘Aaaah!’ denoting satisfaction maintains filter characteristics (the mouth shape which produces ⁄ a ⁄ ) but changes source characteristics (the pitch drops). The filter can also change without the source, such as the ‘Lalalalala’ sound of someone blocking out a speech act they do not want to hear: the source (pitch) stays the same, while the filter changes rapidly (the tongue tip moving alternately up and down). Filter characteristics for vowels are described in terms of formants – frequency regions of the source sound that are amplified for a given vowel (Figure 3; also evident in macaque coos, Figure 2). Formants are numbered in increasing pitch (F1 is lowest, then F2, etc.). F0 is lower than all formants. Ladefoged’s classic A Course in Phonetics (2006) is an excellent introduction to these topics. Numerous source properties differ across talkers, including f0, f0 variability, and different types of vocal-fold vibration such as creaky voice and breathy voice. All of these characteristics indicate differences in meaning in some languages, and in certain prosodic or phonetic contexts in other languages (see Gordon and Ladefoged 2001 for more details). While the filter produces most speech-sound contrasts, the source can also produce contrasts. This is counter to a common view (which, as we shall see below, is incorrect) that talker identity resides in the source, and word identity in the filter. Other source characteristics include jitter (micro-variations in f0) and shimmer (micro-variations in loudness), which, in normal speakers, decrease as vocal loudness increases (Brockmann et al. 2008). Filter properties also differ across talkers. That is, different talkers produce different acoustic realizations of many phonemes (vowels are addressed here, but consonants are also affected; for example, Allen et al. 2003; Newman et al. 2001). Variability exists between accents (Bradlow and Bent 2008; Clopper et al. 2005), genders (Simpson 2009), Fig 3. Speakers from Hillenbrand et al. (1995) producing the vowels ⁄ a ⁄ and ⁄ u ⁄ . (a) Male speaker; (b) female speaker; (c) female child. Dark horizontal bars, highlighted with red lines for the first two vowels, are formants. Note the differences in formants between vowels, and between the three speakers. Measurements and sound files available at Hillenbrand’s web site: . 194 Sarah C. Creel and Micah R. Bregman a 2011 The Authors Language and Linguistics Compass 5/5 (2011): 190–204, 10.1111/j.1749-818x.2011.00276.x Language and Linguistics Compass a 2011 Blackwell Publishing Ltd and individual speakers (Fellowes et al. 1997). Gendered differences may be partly socially conditioned. In children, vocal-tract properties differentiate around puberty (Fitch and Giedd 1999). However, before puberty, female children show higher formant frequencies than males (Figure 4a) even though f0 does not differ yet (Figure 4b; Perry et al. 2001). This suggests that learned ‘gender dialects’ (Johnson 2006) may generate differences in male vs. female speech patterns. Finally, though not emphasized here, longer-time-scale properties like f0 variability (prosody) and timing characteristics may also distinguish talkers. Which of these numerous acoustic properties do listeners use to identify voices? One approach to this question presents several voices (usually 20–40) to listeners, who rate voice similarity or try to identify voices (e.g. Baumann and Belin 2010; Bricker and Pruzansky 1966; Goldinger 1996; Kreiman et al. 1992; Murry and Singh 1980; Singh and Murry 1978). If two voices are confused often, or rated as very similar, what attributes do they share? If rarely confused, or rated very different, how do they differ? Bricker and Pruzansky (1966) explored whether listeners use acoustic attributes independent of speech sounds to identify talkers. This would allow talker recognition and speech recognition to proceed independently, as in two-system models. They presented pairs of vocalizations (e.g. ⁄ i ⁄ from Talker 1 and ⁄ i ⁄ from Talker 14) and asked listeners to say whether the same person or different people produced them. Among other findings, voices that were confused on one vowel ( ⁄ i ⁄ ) were not necessarily confused given another vowel ( ⁄u ⁄ ). This suggests that vowel information and talker-identifying information are not processed independently. Nonetheless, researchers have continued to search for talker-relevant acoustic properties. The most reliable cue, emerging across all studies, is f0. Others include formant frequencies (Baumann and Belin 2010; Murry and Singh 1980), hoarseness (Murry and Singh; Singh and Murry 1978), vowel duration (Murry and Singh; Singh and Murry), and shimmer (Kreiman et al. 1992). However, other than f0, there is little consistency in the cues that are uncovered. A second approach to studying talker identification assesses recognition accuracy when certain cues are removed or distorted. If recognition is impaired, then that cue is probably important to recognition. Van Lancker et al. (1985a; and Van Lancker et al. 1985b) (a) (b) Fig 4. For male and female children and adults in Hillenbrand et al. (1995), F2 vs. F1 (a), and f0 (b). Children do not differ in f0 by gender, but differ in F1 and F2. Children’s ages are reported as between 10 and 12, which is likely (but not with certainty) prepubescent. Adults differ in all three. ***p < 0.0001. Talkers and Language Processing 195 a 2011 The Authors Language and Linguistics Compass 5/5 (2011): 190–204, 10.1111/j.1749-818x.2011.00276.x Language and Linguistics Compass a 2011 Blackwell Publishing Ltd studied famous-voice recognition under different types of distortion (forward or backward playback, altered duration). Each type of distortion impaired recognition of some voices but not others. Remez et al. (1997; Fellowes et al. 1997) removed f0 by presenting sinewave analogs of the talkers’ formants, and found that gender and individual identity could be identified without f0. Fellowes et al. suggested that formant frequencies alone were important to identifying gender, while timing characteristics were also important for identifying individuals. This is admittedly an inconsistent picture of talker identification cues. Van Lancker et al. (1985a) suggest that ‘the critical parameter(s) [for recognition] are not the same for all voices ... any of these cues may serve as a ‘primary’ cue at one time but not another’ (33). That is, many cues potentially distinguish talkers, but certain cues only distinguish a small fraction of talkers. Less common cues are unlikely to show up when only a few voices are studied. Studies may also reflect inconsistency among individual listeners (Kreiman et al. 1992). The best current account is that listeners can exploit myriad acoustic cues to talker identity, and at least some cues also function in speech-sound identification. The next section discusses how overlap between talker and speech information affects speech processing. 3. How does Talker Variability Affect Speech Processing? Given that talkers vary in attributes that distinguish speech sounds, it is not surprising that some acoustic correlates of talker identity affect speech perception. Talker information can both positively and negatively affect speech-sound identification. The positive or negative outcome depends on the type of experimental task. When the talker changes rapidly during an experiment, speech recognition becomes more difficult. However, when talker information is consistent – for instance, when a talker who said a word once says it again – recognition may be facilitated. These will be referred to as talker interference effects and talker specificity effects respectively. 3.1. TALKER INTERFERENCE EFFECTS Rapid talker variation interferes with speech-sound identification. Mullennix and Pisoni (1990), Nusbaum and Morin (1992), and Green et al. (1997) demonstrated this using interference paradigms. In each case, listeners classified a speech sound (as, say, ⁄ b ⁄ or ⁄p ⁄ ) in a set of words where the talker varied unpredictably, or where the talker varied predictably (it stayed the same or was correlated with the speech-sound dimension). If listeners could pay attention just to the phonemes, then unpredictable talker variation should not distract them. However, listeners were slower and less accurate in identifying speech sounds when talkers varied unpredictably, suggesting that they could not attend just to phonemic information. Memory for word-lists (dog, shoe, cup, bike...) also shows talker interference. Listeners recall fewer words from multiple-talker lists than from single-talker lists (Martin et al. 1989). Those authors suggest that cognitive processing space for encoding words was taken up by adjusting to a different talker on each word. Interestingly, this effect reverses when time between words is increased (Goldinger et al. 1991), suggesting that more time allows listeners to use talker differences to more richly encode words in memory. The reason for listeners’ difficulty in perception when talker changes rapidly seems to be the perceptual or attentional adjustment required, often called talker normalization. In an early demonstration, Ladefoged and Broadbent (1957) altered a lead-in phrase, ‘Please 196 Sarah C. Creel and Micah R. Bregman a 2011 The Authors Language and Linguistics Compass 5/5 (2011): 190–204, 10.1111/j.1749-818x.2011.00276.x Language and Linguistics Compass a 2011 Blackwell Publishing Ltd say what this word is’, shifting formants either higher or lower than the original. They presented identical word recordings (such as ‘bit’) after different versions of the lead-in. Listeners’ perceptions of the vowel in ‘bit’ changed depending on the lead-in’s formants. This suggests that the lead-in calibrated listeners to that speaker’s vowel space, consistent with Joos (1948) early account of talker normalization. Effects of normalization are widely acknowledged, but the underlying causes are still debated. One idea is that it represents not just an adjustment to incoming information, but activation of expectations. Johnson (2005; Johnson et al. 1999) describes studies where the talker’s apparent gender influences the perception of speech sounds (see Niedzielski 1999; Staum Casasanto 2008; for dialect effects). This suggests that listeners are normalizing to a memory representation of one group of talkers versus another. Magnuson and Nusbaum (2007) even found that listeners who were told they would hear two talkers in a vowelidentification task performed more slowly than those told they would hear one. Thus, normalization may involve the expectation that one must reorient auditory attention. 3.2. TALKER-SPECIFICITY EFFECTS Distinct from talker interference effects, which presumably result from normalization, are talker-specificity effects: processing is better when talker information remains constant from one presentation to the next. For instance, listeners understand familiar talkers better than unfamiliar talkers in noise (Nygaard et al. 1994). This suggests that familiarity with talkerspecific acoustics facilitates word recognition. Palmeri et al. (1993) presented listeners with a series of words and non-words, each spoken by a particular talker. In a later phase, they presented a word list consisting of words from the first list (50%), and new words (50%). Listeners now made old word ⁄new word judgments. Crucially, old words were spoken either by the original talker, or a new talker. Listeners were more accurate at responding ‘old’ when the original talker spoke the word again. (See Schacter and Church 1992; Church and Schacter 1994 for related work.) At what level is talker-specific information represented? Creel et al. (2008), in a visualworld eye-tracking experiment, found evidence for storage of talker-specific instances of whole words. Other authors have found effects at the sublexical level: Eisner and McQueen (2005; see also Kraljic and Samuel 2005) find talker-specific adaptation to boundary shifts between fricative categories ( ⁄ f ⁄ vs. ⁄ s ⁄ ) which generalize to words not heard with the shifted phoneme. This suggests that phoneme representations, not (just) word representations, are affected by talker-specific information (though see Kraljic and Samuel 2006). 3.3. LEARNING AND DEVELOPMENTAL EFFECTS A final important aspect of talker variability is how it affects language learning. Infants seem to form word representations that are too acoustically specific (e.g. Houston and Jusczyk 2000). Infant recognition is assessed by how long they look at a sound-producing source: if there is a difference in the time spent looking to a familiar sound vs. an unfamiliar sound, we can conclude that infants recognize the familiar sound. By this measure, 7.5-month-olds cannot distinguish a familiarized word from an unfamiliarized one when it is spoken by a new, dissimilar talker (Houston and Jusczyk 2000), though by 10.5 months they can (for parallels in vocal emotion: Singh 2008; Singh et al. 2008; in accent: Schmale and Seidl 2009; Schmale et al. 2010). Conversely, infants are more accurate at identifying words when they are ‘taught’ (exposed to) words in a variety of voices (Rost and McMurray 2009, 2010) or vocal emotions (Singh 2008). Similar benefits of Talkers and Language Processing 197 a 2011 The Authors Language and Linguistics Compass 5/5 (2011): 190–204, 10.1111/j.1749-818x.2011.00276.x Language and Linguistics Compass a 2011 Blackwell Publishing Ltd multi-talker learning are evident in L2 speech-sound acquisition (Lively et al. 1993; Logan et al. 1991), perceptual attunement to non-natively accented speech (Bradlow and Bent 2008), and learning of unfamiliar dialects (Clopper and Pisoni 2004). These studies suggest that new listeners have difficulty generalizing across talker variability, which improves with exposure to a wider range of talkers. One explanation for this effect is that listeners must figure out the right dimensions for identification (F1 indicates meaning differences in my language, while creaky voice does not): with more exposure, learners increasingly tune out cues irrelevant to word identity. This predicts that adults would be less adept than children at identifying voices, because adults have tuned out much talker variability. However, evidence suggests that adults are better than children at discriminating or identifying talkers (Bartholomeus 1973; Mann et al. 1979), inconsistent with adults globally tuning out talker information. An alternative explanation is that learners improve at tuning in to different dimensions of speech variability, contingent on the listening situation (identifying words, vs. identifying talkers). Goldinger (1996, 1998) suggests that word and talker information reside in the same exemplar-style representations (Hintzman 1986; Nosofsky 1989) – mental recordings of every speech instance the listener has ever experienced. Storing them all would allow clusters of information – speech sounds, talkers – to emerge naturally. Different clusterings could be focused on by directing attention to particular acoustic attributes (see Nosofsky 1989; Werker and Curtin 2005). For instance, attention to creaky voice might be higher when trying to recognize a talker vs. a word. 4. Knowledge-based Influences of Talker Information on Speech Processing Finally, talker information is useful at the sentence or discourse level. Speech provides rich information about the talker’s attributes (Ladefoged and Broadbent 1957; femininity, Ko et al. 2009; sexual orientation, Pierrehumbert et al. 2004; physical size, Krauss et al. 2002). This information can be used to predict what the talker may say. Some researchers have argued that activating social knowledge along with speech subtly alters the speech’s meaning (Geiselman and Bellezza 1976, 1977; Geiselman and Crawley 1983). Geiselman and Bellezza (1977), taking talker gender as a test case, found that listeners rated printed sentences as more ‘potent’ when men had spoken them previously than when women had. One hopes this effect would be moderated somewhat 30 years later. Dimming these hopes, Ko et al. (2009) found that judgments of a talker’s perceived competence (at carrying out a hypothetical job the talker had applied for) were strongly related to vocally based judgments of femininity. Adults rapidly invoke talker-related knowledge in processing sentences. Using electroencephalography, Van Berkum et al. (2008) found that listeners’ predictions of upcoming words in spoken sentences differed by speaker. Previous work (see Kutas and Hillyard 1980, 1984) shows that when listeners hear a sentence like ‘I like coffee with ___’, and then the sentence ends with an unexpected word (e.g. ‘dog’), the brain generates a particular electrophysiological response – an N400 – which does not appear for listeners who hear an expected completion (‘sugar’). Van Berkum et al. looked for the N400 in a sentence such as ‘I like to drink wine’. Listeners who heard an adult speaking showed a smaller N400 to ‘wine’ than listeners who heard a child, suggesting that listeners interpret spoken material contingent on vocally perceived social attributes of the talker. There is also a developmental aspect to learning social correlates of talker variability. Neonates (DeCasper and Fifer 1980) and even fetuses (Kisilevsky et al. 2003) respond differently to their mother’s voice than to a stranger’s voice, suggesting prenatal encoding 198 Sarah C. Creel and Micah R. Bregman a 2011 The Authors Language and Linguistics Compass 5/5 (2011): 190–204, 10.1111/j.1749-818x.2011.00276.x Language and Linguistics Compass a 2011 Blackwell Publishing Ltd of talker characteristics. Kinzler et al. (2007) showed that infants preferred to look at faces associated with their native language, and 5-year-olds preferred to befriend native-language or native-accent speakers over non-native ones (see also Hirschfeld and Gelman 1997). These studies suggest that children use acoustic familiarity to socially evaluate talkers. Going beyond effects of familiarity, Creel (2010) found evidence for specific, high-level mappings of voice attributes to individuals: preschool-aged children who knew two talkers’ favorite colors used talker information early in an instruction sentence (‘Can you help me find the square?’) to visually fixate shapes of the talker’s preferred color, prior to hearing the shape name itself. Crucially, they only did so when the talker asked for herself, not when she asked for the other talker. This suggests that, at least in simple cases, children use talker information to constrain sentence interpretation like adults. It is less clear whether children would be able to learn or utilize more subtle or complex information. 5. General Discussion In this review, we have explored talker identification and how it relates to language processing. We discussed its similarities to within-species individual recognition in songbirds and primates, and outlined attempts to characterize acoustic cues to human talker identity. We then described two levels at which talker-related acoustic cues affect language processing: at the phonological level, and at the sentential level. We close by outlining several interesting questions regarding talker variability and speech processing. First, how closely is language related to individual recognition in primates? Among primates, only we humans learn our vocalizations. Did language emerge from non-human primate vocalizations? We cannot answer this definitively. However, one useful observation concerns bird vocalizations, which, though acoustically very different from primate vocalizations, may provide an evolutionary parallel: not all vocalizing bird species learn their vocalizations (e.g. chickens; Konishi 1963). This implies that recognizing non-learned vocalizations is a necessary evolutionary precursor to more flexible vocal learning (Dubbeldam 1998). That is, recognition of non-learned vocalizations in primates may have formed an evolutionary platform for human vocal learning – humans are the ‘songbirds’ of primates. Another question is what, other than phoneme-varying cues, distinguishes talkers. Recall that some sound properties which vary idiosyncratically by talker in one language vary phonemically in others (Gordon and Ladefoged 2001). This suggests that many, if not all cues, can function dually to identify phonemes and talkers. Distinguishing talkers may depend on attunement to cues that differentiate the talkers in one’s own range of experience (see Perrachione et al. 2010, for supporting evidence). Overall, the degree of overlap between talker-identifying cues and speech-relevant cues, and listeners’ apparent inability to process speech information and talker information separately, suggests that language is not a module which is impenetrable to acoustic variability. A final question, also related to modularity, is how talker and speech information are stored – as the output of two separate systems, part of a single set of representations (e.g. collections of all experienced speech), or both? Three pieces of evidence are relevant. First, as discussed above, many talker-varying acoustic properties also identify speech sounds, suggesting that the same representations may be needed for both talker and speech-sound identification. Second, studies on talker-specificity effects show that when listeners learn to recognize a particular talker’s unusual phoneme (e.g. ⁄ s ⁄ is ambiguous between ⁄ f ⁄ and ⁄ s ⁄ ), they adjust recognition of other words that contain this phoneme Talkers and Language Processing 199 a 2011 The Authors Language and Linguistics Compass 5/5 (2011): 190–204, 10.1111/j.1749-818x.2011.00276.x Language and Linguistics Compass a 2011 Blackwell Publishing Ltd (e.g. Eisner and McQueen 2005; Kraljic and Samuel 2005). That is, talker-specific properties affect recognition of speech sounds. If talker information were stored separately from speech-sound information, this should not happen – a distorted phoneme would be useful for identifying the talker as that individual, but would not aid recognition of speech from that talker. Further, listeners would adapt across-the-board to the shifted phoneme, not to a specific talker’s productions of that phoneme (Eisner and McQueen; Kraljic and Samuel 2005; though see Kraljic and Samuel 2006). Finally, as noted in the Introduction, some neuroimaging evidence suggests two separate systems, with speech sounds analyzed in the left brain hemisphere and talker identity in the right (Belin et al. 2004; González and McLennan 2007). Other studies suggest left hemisphere contributions to talker recognition (Perrachione et al. 2009) and environmental sound recognition (Saygin et al. 2003). Why the discrepancies? Results across studies may differ in the temporal scale of the acoustic information being used to identify speech sounds vs. talkers (for instance, voice onset time vs. prosody). This is because the left hemisphere seems biased toward rapid temporal events while the right processes slower events (Zatorre and Belin 2001), meaning that right-hemisphere activation may reflect use of slower temporal scales rather than specialization for talker identification. That is, listeners distinguish speech sounds by short-time-scale properties, but tend to distinguish voices – especially unfamiliar ones – by slower-temporal-scale properties. The information to be identified (talker vs. speech) and expertise or familiarity with the voices (see Perrachione et al. 2009) may affect the temporal scale used, with distinctions among more familiar voices being made from more temporally fine-grained information. In sum, much research indicates that talker-specific acoustic patterns are related to and impact language processing. However, there is much to learn about how listeners represent talker-identifying information, and what the real-world ramifications are for language processing. We hope the reader is intrigued and tantalized by the prospect of new developments in this area.
منابع مشابه
The impact of musical training and tone language experience on talker identification.
Listeners can use pitch changes in speech to identify talkers. Individuals exhibit large variability in sensitivity to pitch and in accuracy perceiving talker identity. In particular, people who have musical training or long-term tone language use are found to have enhanced pitch perception. In the present study, the influence of pitch experience on talker identification was investigated as lis...
متن کاملOn-line acoustic and semantic interpretation of talker information
Recent work demonstrates that listeners utilize talker-specific information in the speech signal to inform real-time language processing. However, there are multiple representational levels at which this may take place. Listeners might use acoustic cues in the speech signal to access the talker’s identity and information about what they tend to talk about, which then immediately constrains proc...
متن کاملOn the Link between Identity Processing and Learning Styles among Young Language learners
The present study attempted to investigate the probable relationship between Iranian young language learners’ identity processing styles and their learning styles. To this end, 29 advanced learners, 23 females and 6 males were randomly selected from an English language Institute. Twenty nine advanced young language learners were chosen randomly out of whole advanced young language learners in t...
متن کاملTraining to use voice onset time as a cue to talker identification induces a left-ear/right-hemisphere processing advantage.
We examined the effect of perceptual training on a well-established hemispheric asymmetry in speech processing. Eighteen listeners were trained to use a within-category difference in voice onset time (VOT) to cue talker identity. Successful learners (n=8) showed faster response times for stimuli presented only to the left ear than for those presented only to the right. The development of a left...
متن کاملProcessing relationships between language-being-spoken and other speech dimensions
While indexical information is implicated in language processing, little is known about the internal structure of the system of indexical dimensions itself, particularly in bilinguals. A series of three experiments using the speeded classification paradigm investigated the relationship between various indexical and non-linguistic dimensions of speech in processing (talker identity, talker gende...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Language and Linguistics Compass
دوره 5 شماره
صفحات -
تاریخ انتشار 2011